Most real-world high-dimensional datasets lie closeto a much lower-dimensional manifolds embedded within the high-dimensional space.
거리를 이용...
선형변환, matrix factorization
rotation과 scaling에 강하다., graph layout
PCA, t-SNE, UMAP
maximum variance
minimum projection error
PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized (Hotelling, 1933). Equivalently, it can be defined as the linear projection that minimizes the average projection cost, defined as the mean squared distance between the data points and their projections (Pearson, 1901).

Theorem 1 $A$ is diagonalizable if and only if $A$ has linearly independent eigenvectors.
$$ A = P D P^{-1} \text{ where } D = \text{ diag}(\lambda_j)$$Theorem 2 Real symmetric matrices are diagonalizable by orthognormal matrices.
$$ A = Q D Q^\intercal \text{ where } QQ^\intercal = Q^\intercal Q = I, D = \text{ diag}(\lambda_j)$$
$$ \mathbf{w} = \arg \max_\mathbf{w} \frac{\mathbf{w}^\intercal X^T X \mathbf{w}}{\mathbf{w}^\intercal \mathbf{w}}$$
sklearn.decomposition.PCA(n_components=None, *, copy=True, whiten=False, svd_solver='auto', tol=0.0, iterated_power='auto', n_oversamples=10, power_iteration_normalizer='auto', random_state=None)
None은 가능한 많은 축을 식별한다. n_components = min(n_samples, n_features) - 1explained_variance_ratio_는 전체 분산에서 각 principal component가 설명할 수 있는 비율을 보여준다.pca.explained_variance_ratio_
>>> array([0.72962445, 0.22850762, 0.03668922])
sklearn.decomposition.KernelPCA(n_components=None, *, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False, eigen_solver='auto', tol=0, max_iter=None, iterated_power='auto', remove_zero_eig=False, random_state=None, copy_X=True, n_jobs=None)
kernel: 'linear' (default), 'poly', 'rbf', 'sigmoid', 'cosine', 'precomputed'
gamma: Kernel coefficient for rbf, poly and sigmoid kernels. Ignored by other kernels. If gamma is None, then it is set to 1/n_features
from scipy.spatial.distance import pdist, squareform
from scipy import exp
from scipy.linalg import eigh
import numpy as np
def rbf_kernel_pca(X, gamma, n_components):
"""
RBF kernel PCA implementation.
Parameters
------------
X: {NumPy ndarray}, shape = [n_examples, n_features]
gamma: float
Tuning parameter of the RBF kernel
n_components: int
Number of principal components to return
Returns
------------
X_pc: {NumPy ndarray}, shape = [n_examples, k_features]
Projected dataset
"""
# Calculate pairwise squared Euclidean distances
# in the MxN dimensional dataset.
sq_dists = pdist(X, 'sqeuclidean')
# Convert pairwise distances into a square matrix.
mat_sq_dists = squareform(sq_dists)
# Compute the symmetric kernel matrix.
K = exp(-gamma * mat_sq_dists)
# Center the kernel matrix.
N = K.shape[0]
one_n = np.ones((N,N)) / N
K = K - one_n.dot(K) - K.dot(one_n) + one_n.dot(K).dot(one_n)
# Obtaining eigenpairs from the centered kernel matrix
# scipy.linalg.eigh returns them in ascending order
eigvals, eigvecs = eigh(K)
eigvals, eigvecs = eigvals[::-1], eigvecs[:, ::-1]
# Collect the top k eigenvectors (projected examples)
X_pc = np.column_stack([eigvecs[:, i]
for i in range(n_components)])
return X_pc
IncrementalPCA와 partial_fit method를 사용한다.np.memmap으로 disk 공간을 memory로 활용할 수 있다. IncrementalPCA와 함께 사용한다.https://lovit.github.io/nlp/representation/2018/09/28/mds_isomap_lle/
sklearn.manifold.MDS(n_components=2, *, metric=True, n_init=4, max_iter=300, verbose=0, eps=0.001, n_jobs=None, random_state=None, dissimilarity='euclidean')
sklearn.manifold.LocallyLinearEmbedding(*, n_neighbors=5, n_components=2, reg=0.001, eigen_solver='auto', tol=1e-06, max_iter=100, method='standard', hessian_tol=0.0001, modified_tol=1e-12, neighbors_algorithm='auto', random_state=None, n_jobs=None)
sklearn.manifold.Isomap(*, n_neighbors=5, radius=None, n_components=2, eigen_solver='auto', tol=0, max_iter=None, path_method='auto', neighbors_algorithm='auto', n_jobs=None, metric='minkowski', p=2, metric_params=None)
Input space에서 관찰값들의 결합확률을 계산하고 이 확률분포를 낮은 차원의 feature space로 mapping한다.
첫 번째 단계에선 Gaussian 분포를 가정하고 조건부 분포를 계산하고 두 번째 단계에선 t-분포(자유도가 1인 Cauchy distriibution)를 이용하여 Kullback–Leibler divergence를 극소화하는 방법을 사용한다.
$q_{ij}$의 영향을 거의 받지 않는다.
sklearn.manifold.TSNE(n_components=2, *, perplexity=30.0, early_exaggeration=12.0, learning_rate=200.0, n_iter=1000, n_iter_without_progress=300, min_grad_norm=1e-07, metric='euclidean', init='random', verbose=0, random_state=None, method='barnes_hut', angle=0.5, n_jobs=None, square_distances='legacy')
It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples. TSNE documentation
Assuming uniformly distributed data set on a manifold, project the approximation to a lower-dimensional space.






$d(\cdot)$은 다양한 선택이 있다. metric
$\rho_i$는 표본 $i$에서 가장 가까운 표본과의 거리이다. 따라서 거리의 기준은 표본의 국지적인 분포에 영향을 받으며, 모든 관찰값에서 달라질 수 있다. 따라서 실제 거리와는 관계없이 제일 가까운 표본과의 similarity value는 항상 0 이고, local connectivity를 만족한다. 두 점이 서로 제일 가깝다면 $p_{i|j} = p_{j|i}$가 된다.


min_dist로 각 점 사이의 최소한의 거리 지정한다.min_dist의 값이 작을수록 clustering 효과가 크며 값이 클수록 전체적인 형태를 이해하는데 도움이 된다.!conda install -c conda-forge umap-learn
umap = UMAP(n_neighbors=15, # The size of local neighborhood used for manifold approximation.
n_components=2, # The dimension of the space to embed into.
metric='euclidean',
n_epochs=None, # larger values result in more accurate embeddings.
learning_rate=1.0, # The initial learning rate
min_dist=0.1, # The effective minimum distance between embedded points.
local_connectivity=1, # The local connectivity required
random_state=None
)